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Abstract 



In this study, we combine bibliomctric techniques with a machine learning algorithm, the 
sequential Information Bottleneck, to assess the interdisciplinarity of research produced by the 
University of Hawaii NASA Astrobiology Institute (UHNAI). In particular, we cluster abstract 
data to evaluate Thomson Reuters Web of Knowledge subject categories as descriptive labels 
for astrobiology documents, assess individual researcher interdisciplinarity, and determine where 
collaboration opportunities might occur. We find that the majority of the UHNAI team is 
engaged in interdisciplinary research, and suggest that our method could be applied to additional 
NASA Astrobiology Institute teams in particular, or other interdisciplinary research teams more 
broadly, to identify and facilitate collaboration opportunities. 



Keywords: Astrobiology, Bibliometrics, Information bottleneck method, Inter- 
disciplinary science, Machine learning, Text mining 



1 Introduction 

Astrobiology, the study of the origin, evolution, distribution, and future of life in the universe, is 
a relatively new field comprised of researchers from a range of scientific disciplines. Apart from 
its subli me object of s tudy, astrobiology has been identified as a field that can integrat e diverse 
sciences (jStalevl . l20Q3h . provide a tangible target for i nterdisciplinary science educ ation (|Cockell 



and provide a pathway to adult science literacy (jOliver and Fergussonl . l2007h . Many of the 



field's core questions require knowledge from multiple disciplines to be harvested, integrated and 
applied outside of their source domains, and as such, astrobiology is inherently interdisciplinary. For 
example, the University of Hawaii NASA Astrobiology Institute (UHNAI) studies the origin of water 
in the solar system and beyond, in the context of understanding the origins of life. Astronomers, 
chemists, geologists, oceanographers and biologists work together to study data from meteorite 
fragments to comets to the interstellar medium to address the question of where else in the universe 
water, and thus life, might be found. Without collaboration across disciplinary boundaries to 
interpret often-scarce data, important questions in astrobiology will remain incompletely addressed. 
Developing a method to identify, measure and catalyze interdisciplinary work in the astrobiology 
research environment is the goal of this paper. 

One of the benefits of a broad-based research community is that new developments in astrobiol- 
ogy occur fairly frequently. The downside is that researchers must stay abreast of these numerous 
developments both inside and outside of their home fields. As new astrobiology research findings 
are reported, the considerable effort involved in finding, evaluating and integrating them indicates 
a need for a better understanding of how findings in one field might inform others, and to identify 
potential collaboration opportunities between individual researchers working on similar questions. 

Our previous example suggested that knowledge from multiple disciplines is required to under- 
stand the origin of water to answer questions regarding the origin of life. Satisfactorily understand- 
ing the research record of s cientists that work i n this area requires measuring interdisciplinarity 



on an acute scale. Following Ivan Leeuwenl (|2007l ). we distinguish between a top-down bibliometric 



approach, where large-scale trends at the highest levels of publication aggregation are considered 
(such as the research output of a country or university), and prefer a bottom- up approach, where 
we analyze individual documents and the papers they cite. We harvest each astrobiologist's pub- 
lication data by comparing NASA Astrobiology Institute annual reports, where publications are 
systematically documented, with the researchers' Websites and CVs, and further verify the data by 
browsing the author indexes of each database to identify name variations, to represent the research 
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output of each astrobiologist. 

A common method used to examine the potential of collaboration across disciplinary boundaries 
is to interview domain experts , but this method suffers from several limitations, such as sample 
size and subjectivity problems (jZhang et all l201ll ). Furthermore, given that the subject matter of 
astrobiology spans many disciplines, meaningful analysis of the responses would require the knowl- 
edge of an astrobiology polymath. After considering these limitations, we suggest that measuring 
interdisciplinarity should be guided by one or more individuals versed in astrobiology, but whose 
expertise need not span all of its constituent disciplines. Therefore, an unsupervised approach is 
optimal as such methods can find trends in data without prior knowledge of its structure. 

As of 2011, the NASA Astrobiology Institute is comprised of 14 teams spanning ten universities 
in addition to NASA Ames, Goddard, and the Jet Propulsion Laboratory. While a cross-team 
analysis is beyond the scope of this paper, we suggest that our method for measuring researcher 
interdisciplinarity at UHNAI could be extended to other NASA Astrobiology Institute teams, and 
to scientific collaborations more broadly. Furthermore, our method suggests where collaborations 
might productively occur, and allows us to better understand the nature of interdisciplinary scien- 
tific discovery. 

In this pilot study, we investigate the use of an u nsupervised machine learning clustering tech- 
nique, the sequential Information Bottleneck (sIB) ( Slonim et al. . 2002) to aid in measuring re- 
searcher interdisciplinarity. Furthermore, we assess the extent to which Journal Subject Categories 
from the Thomson Reuters Web of Knowledge database suite are sufficient for labelling astrobi- 
ology documents. The clustering and classification of text allow interdisciplinary analysis that 1) 
describes collaboration and the integration of knowledge and 2) draws conclusions that are useful 
to astrobiology researchers by uncovering the underlying structure of research tracks. The results 
of this pilot study will serve to guide a subsequent investigation that will identify collaboration 
opportunities and measure the disciplinary roots across the entire NASA Astrobiology Institute. 

Researchers in astrobiolog y tend to be comfortab l e spe aking in the language of multiple scientific 
disciplines. As suggested in iGargaud and Tirardl (|201lh . these interdisciplinary researchers are 
somewhat isolated from their counterparts in other academic departments. The multidisciplinary 
context given by astrobiology affords an excellent opportunity to examine the methods used to study 
researcher interdisciplinarity and knowledge integration. Furthermore, we propose an iterative 
process to identify specific publications that bridge diverse fields, to facilitate interdisciplinary 
collaborations and ease the cognitive load of a single researcher who wishes to integrate knowledge 
from multiple disciplines. 



2 Background 



Researc h that occurs at th e inter section between disciplines is thought to lead to great advances in 
science (|Porter and Rafoli lioosh . Many funding agencies exist specifically to support and encour- 
age interdisciplinary research; the U.S. Nation al Science Foundation's interd i sciplinary research ef- 

forts span all of their divisions and directorates ([National Science Foundation! . I Accessed November 21. 2011 
For example, some authors measuring int erdisciplinarity lament tha t there is not enough coverage 
of the societal causes for climate change ( Bjurstrom and Polk . 201 ll ) as described in the Intergov- 
ernmental Panel on Climate Change (IPCC) literature. In this specific case, measuring both the 
disciplinary diversity and the integration of knowledge is of paramount importance to ensure that 
future IP CC reports inc lude appropriate factors. A cynical disposition to this problem is eloquently 
stated in iBrewerl "The world has problems, but universities have departments." 

Many important terms in this work have so far been discussed without qualification. The term 
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interdisciplinary tends to be tacit ly understood by re searchers, with no consensus definition. We 



adopt the definition sugges ted by iPorter et al.l (120071 ) , which followed the definition given by the 
National Academies! ( 20051 ): interdisciplinary research requires an integration of concepts, theo- 
ries, techniques and/or data from two or more bodies of specialized knowledge. Multidisciplinary 
research may incor porate element s of ot her bodies of specialized knowledge, but without interdis- 
ciplinary synthesis (jWagner et all l201ll ) that leads to research that is greater than the sum of its 
parts. 



Despite the increase in claimed interdisciplinarity, 



value in assessing and quantifying interdisciplinary research (IMorillo et al. 



traditi onal indicators are of questionable 
' 200lh . Additionally, 



policies regarding interdisc i plina rity are often based more on conventional wisdom than empirical 
studies ( Rafols and Meyer . 2010l ). The usefulness of bibliometric indicators depends critically on 
the level at which we wish to understand the integrative process. For example, funding agencies 
may only require high-level publication co-authorship and collaboration statistics, describing the 
research performed by their grantees and the diversity of their home disciplines, but not addressing 
the essential aspect of synthesis. When there is no mechanism to identify, measure and encourage 
these points of intellectual crossover, there is no way to quantify the extent to which interdisciplinary 
science is taking place. 



Top-down approaches have been used to map scientific literature (for example, see lBovack et al 



(|2005h ;i. and often represe nt broad areas of science with Web of Knowledge (WoK) subject cate- 
gories (SCs). For example, van Raan and van Leeuwen (2002) and Porter et al. ( 2007 ) used SCs in 
their methodology to measure interdisciplinarity. In these studies, SCs have been employed as de 
facto disciplinary boundaries, and as a benchmark to measure how much a given author, journal 
or research area crosses scientific fields. Unfortunately, low- level conclusions that might inform 
potentially productive individual collaborations cannot be made when relying on these top-down 
approaches, as they focus on past outputs rather than future integration. Conversely, bottom- 
up bibliometric approaches incorporate the authors' own words, in free-text fields such as: titles, 
abstracts, keyword^ and the full text of a document. Clustering bibliometric data at this level 
can describe the structure of a researcher, journal or an entire field, and suggest productive fu- 
ture directions. Comparing the bottom-up clustered output with the top-down approach of SCs 
for astrobiology publications yields an indi cation o f the e ffectiveness of SCs as document labels 
for works in this interdisciplinary domain. Kostoff ( 19981 ) describes how a citation analysis can 
serve as a "radioactive trace" of research impacts. One limitation of cluster analysis i s that ".. .pre - 
cise disciplinary divisions are not obtained, rendering inter-cluster links misleading" (ISmalll . 
but Upham and Small ( 2010l ) propose a methodology to identify emerging "research fronts", highly 
cited micro-specialty areas that transcend existing fields. Their method requires that the researcher 
not presuppose the existence of any research field, but to rely instead on a comprehensive mon- 
itoring of citations to identify points across disciplines where research interests intersect, echoing 
one goal of the present s tudy. Both top-down an d bottom-up approaches are useful in different 
applications. A study by Rafols and Meyer ( 2010l ) combines bottom-up and top-down approches 
to measure both disciplinary diversity and knowledge integration. 

Measuring scientific output in bibliometric terms requires some degree of integration and nor- 
malization of the publication records of researchers, which are published in diverse formats, venues 
and scholarly traditions. The publication record generally includes data such as departmental 
affiliations, keywords, year of publication, journal, cited references, and the abstract and/or full 
text of the publication. This data can be compared using various bibliometric techniques to assess 
interdisciplinary research. While bibliometric studies tend to rely on a citation analysis, such an 



1 Keywords are not always a free-text field. 
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analysis may not be appropriate for every discipline or field. For example, a given field may tend 
to reference conference proceedings, websites, newspapers, or colloq uia which are no t as conducive 
to a co-citation analysis as journal articles. Due to this observation, ISug imotol (|201lh suggests that 
studying interdisciplinarity should include publications beyond journal articles. While we agree 
with this position, it happens that journals are the preferred method of communication within the 
great majority of the fields that compose the UHNAI team; therefore, the present study is not 
hindered by this limitation. 



3 Methodology 

In this section, we outline our method for measuring interdisciplinary research. In the previous 
section we noted that particular bibliometric indicators are conducive to understanding research 
at varying levels. One of the goals of this research is to uncover the underlying structure within 
an astrobiology research team that undertakes interdisciplinary projects at the macro scale, but 
may differ in the extent of interdisciplinary work at the micro level. To understand the research 
structure, we examine the abstract text of research publications and employ a method from the 
field of information theory, the sIB method, to cluster our high dimensional abstract data. 

An advantage of using WoK for bibliometric studies is that it provides a mapping of SCs to 
each journal. Given the incommensurability of other bibliometric data (for example, journals do not 
agree up on a common set of keywords), SCs provide a way to compare publications on the journal 
level. In lZhang etHI (|2O10h . the authors used a cross-citation analysis to create seven high-level 



clusters of related SCs, though their analysis was somewhat c onfounded b y the "idiosyncrasy" 
that a journal may be assigned to multiple SCs in WoK. In iPorter et al. (|2007h . the authors 



examine the references in sets of journal articles gathered from WoK, and relate the journals to 
their corresponding SCs. In this approach, a more diverse set of SCs that represent a paper derived 
from its references indicates a higher degree of interdisciplinarity than a set of similar SCs that 
represent a paper. 

Using the r eferences of a paper is a reasonable approach to measure researcher interdisciplinarity. 
Analoguous to lPorter etHl (l2007h . we use the references in each UHNAI publication. In particular, 



we combine all of the abstracts of all of the references cited by a UHNAI publication, and use these 



aggregated abstracts to represent each publication. In another text mining study (IKostoff et al 



20011 ). employed free-text fields (such as title, keywords and abstracts) of cited/citing publications 
in combination with phrase frequency analysis and phrase clustering analysis to obtain a low-level 
understanding of research impact and interdisciplinary research. 

In the present study, we focus on the abstracts of cited papers only, and we do not consider the 
papers that cite the UHNAI papers. A major limitation of studying the citations to the UHNAI 
papers is that it would require the database to contain those papers that cite a particular work, 
which varies between disciplines, fields and databases. The same is true of those references that 
are cited in our UHNAI papers. To obviate this problem, we elect to use the NASA Astrophysics 
Data System (ADS) to collect the majority of our abstracts, as more UHNAI team publications are 
covered in this database than any other. The extensive coverage in ADS ensures that a considerable 
majority of papers referenced by the UHNAI team are also within the database. However, previous 
research has illustrated how the differences in scientific publication patterns between fields often 
require that records from multiple databases b e harvested to encompass th e output of interdisci- 
plinary scientific researchers (see, for example, Kousha and Thelwall . 20081 ). For UHNAI authors 



whose publications were not sufficiently represented in ADS, we used WoK to obtain their publica- 
tion data and cited references. As it turns out, those authors, and the papers they cite, were highly 
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represented in WoK. We were able to gauge author coverage in ADS and WoK by consulting the 
CV of each UHNAI team member. 

In the following subsections, we describe our methods used to achieve the following goals: 



• Examine whether WoK SCs are sufficient for labelling astrobiology documents. While we 
believe SCs are useful in mapping broad scientific research trends in established disciplines, 
whether they are appropriate for classifying individual publications or interdisciplinary works 
remains an open questional We cluster a corpus of astrobiology abstracts labelled with their 
corresponding conflated SCs (Section I3.3[) . and assume that if a given cluster is comprised 
mostly of a single SC, then SCs are a sufficiently accurate classifier. 

• Identify actual and potential instances of interdisciplinary research in astrobiology using con- 
flated SCs (Section I3.3f> . 

• Identify actual and potential instances of interdisciplinary research and identify potential 
collaboration opportunities between researchers using aggregated abstracts to represent the 
research tracks of the UHNAI team (Section 13. 4p . 



3.1 Text Mining and the Sequential Information Bottleneck (sIB) Method 

The sIB clustering algorithm ( Slonim et all 20021 ) is employed to cluster our datasets described 
below. We chose this clustering method over others because it has been shown to perform better 
than other unsupervised clustering methods, such as k- means (jSlonim et all l2002h . Furthermore, 
the approach should allow us to identify instances of interdisciplinary research by examining the 
cluster membership of our abstract data without prior knowledge of the data's properties. It 
is necessary to use an unsupervised clustering method because a canonical set of astrobiology 
documents with which to train a clustering technique does not exist. 



3.2 Data Collection 

We gather publications by the UHNAI team members from 2001 until June 2011. Publications 
earlier than 2001 were not collected, as many researchers may not have been engaged in astrobiology 
research, and the UHNAI team was not yet foundecH. However, we place no age restrictions on the 
papers that they cite. 



3.2.1 NASA Astrophysics Data System (ADS) 

The ADS has extensive coverage of astronomy, astrophysics and physics journal articles, pre-prints 
and conference proceedings. We gather the data in a semi-automated fashion. Instead of accessing 
the articles through a web browser, ADS has a perl script librar)Q that can be used to access parts 
of the database. To gather the abstracts and journals of UHNAI papers, and the abstracts and 
journals of the papers they cite, we employed the following procedure: 

2 The classification of documents is a requirement for an astrobiology publication information retrieval system. Our 
research group is inclined to create such a system. See http : //airframe . ics .hawa ii . edu/| for more information. 

3 This is also the year that the journal Astrobiology began publication. While astrobiology research was, and 
continues to be published in other journals, this indicates that astrobiology research may not have coalesced as a field 
prior to 2001. 

4 The scripts can be found here: http://vo.ads.harvard.edu/adswww-lib/ 
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• Ran one of the ADS perl scripts to return a list of all of the publications for each UHNAI 
team member. This returned a list of ADS bibcodes, which uniquely identify each record in 
the ADS. 

• Compared these papers with each author's CV to ensure that we did not collect undesired 
articles. For example, we filtered out papers by authors with the same last name as members 
on the UHNAI team. 

• Used the ADS bibcodes to create a script that goes to the URL of the webpage that lists the 
references in each UHNAI paper. We download the individual webpages. 

• Created and ran a script to capture all of the ADS bibcodes in each downloaded html webpage. 

• Used this list of bibcodes to get the abstracts and journals of all of the UHNAI papers and 
references therein using the ADS perl scripts. 



3.2.2 Web of Knowledge (WoK) 

To include the published output of UHNAI authors whose work is underrepresented in ADS, and to 
provide a comprehensive portrait of the entire UHNAI team, we also used WoK to gather abstracts 
and bibliographic data. To our knowledge there is not an API or alternative way to access WoK 
other than using a web browser. To gather this data, we employed the following procedure: 

• Created a list of all of the papers authored by UHNAI authors that were not in or underrep- 
resented in the ADS database. 

• Manually downloaded the html pages of each record describing each UHNAI publication and 
references therein. 

• Created a script that parses the html pages to harvest the abstracts and journal titles. 

When working with WoK data, it is important to be mindful of the differences in institu- 
tional subscriptions, which include access to different subsets an d date ranges of WoK's constituen t 
databases, and may affect the results of bibliometric analysis ( Derrick et al. . 2010 : Jacsdl . 2005). 



Therefore, we provide a list of the University of Hawaii WoK subscriptions at the time of data 
collection: 

• Web of Science, 1980 - 

• Biological Abstracts, 1969 - 

• Medline, 1950 - 

• Journal Citation Reports Science & Social Science editions, 2004 - 



3.3 WoK Subject Categories and Document Classification 

Having collected the abstracts and journal names of UHNAI publications and references, we create 
a dataset that contains the abstract text and the SC of the associated journal for each UHNAI 
publication and the publications they cite. 

Many of the SCs of the papers in our dataset were si gnificantly underrep resented. Furthermore, 



;epi 

as other researchers have encountered (see, for example. IZhang et all l2010l ) some journals in WoK 



are assigned multiple SC s, necessitating some conflation into superclusters, or "macro-disciplines" 
(jPorter and Rafolsi . l2009h . We modify the SCs using the following method: 
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• Journals with a single WoK SC that appears 10 or more times in our dataset uses the assigned 
WoK SC name. 



• Journals with a single WoK SC that appears less than 10 times is changed to a broader 
WoK category (e.g. "Biochemical Research Methods" becomes "Biochemistry & Molecular 
Biology"). 

• Journals with two or more SCs of roughly equivalent weight are assigned a new conflated SC 
(e.g. "Astrophysics & Geophysics"). 

• Journals with two or more SCs that have a clear primary SC have "-Multidisciplinary" ap- 
pended to the primary name. 

The ADS system also contains non-journal publications. In these instances, we manually as- 
signed an appropriate SC to the publication. Table [1] shows the mapping of WoK SCs to our 
conflated SCs. 

We eliminated those abstracts whose SCs were unique or constituted a very small fraction of the 
entire dataset. Furthermore, publications are commonly cited across the UHNAI team; therefore, 
multiple duplicate abstracts could appear in our dataset. Duplicate abstracts were removed from the 
sample. Once removing the duplicate abstracts, we traced back which databases these abstracts 
came from to reflect the relative proportion of abstracts obtained through WoK and ADS. The 
dataset has 10216 abstracts integrated over 13 conflated SCs. Table [2] shows the number and 
fraction of abstracts labeled with each conflated SC. From Tabled we observe that there is a large 
class imbalance problem, as the Astronomy & Astrophysics SC contributes 67.68% of the entire 
dataset. 

We oversample the minority classes (where a class is a SC), which is every SC but Astronomy &: 
Astrophysics, to examine if the class imbalance problem significantly affects the resultant clusters. 
There are a number of methods utilized to oversample minority classes in the field of data mining. 
Duplicating the abstracts in the minority classes could potentially result in model overfitting. 
To obviate this problem, we create synthetic data that is similar to the other a bstracts within 



a giv en SC. We use the Synthetic Minority Over-sampling Technique (SMOTE) (jChawla et al. 



2002) to produce synthetic feature vectors, where a feature vector (or feature) is a normalized 
numerical representation of the words that describe each abstract/instance. For example, consider 
the following two truncated abstracts: 

1. Water is found on the earth and in the solar system. 

2. Water exists on the moon, and Mars. 

The two feature vectors of word counts after punctuation is removed is shown in Figure [TJ 

For clustering our abstracts and their corresponding conflated SCs, we create two datasets. 
In the first dataset (hereafter conflated_SC-default), we cluster the dataset as described by Ta- 
ble El without considering the class imbalance problem. In the second dataset (hereafter con- 
flatedSC 'sampled), we randomly sample without replacement 25% of the features contained within 
the Astronomy & Astrophysics SC and every feature in the minority SCs three times. We use 
SMOTE to create synthetic feature vectors for the minority SCs such that each SC is represented 
by the same number of features. A visual representation of the distribution of real and synthetic 
data is shown in Figure [2j 
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3.4 Text Mining Aggregated Abstracts 

We create a dataset of aggregated abstracts (hereafter aggregated-abstracts) for the purposes of 
representing each UHNAI publication. The dataset contains 731 publications by the UHNAI team. 
Table [3] shows the team members and their associated home disciplines. Each publication is rep- 
resented by its own abstract and the abstract of each cited publication. We aggregate all of these 
abstracts in a single feature vector to represent each UHNAI publication. Non-journal publica- 
tions such as book chapters, conference proceedings and dissertations were included in the dataset, 
although they constitute a very small fraction of the total publications. A majority of the ab- 
stracts in the aggregatecLabstracts dataset are the same as the ones in the conflatedSC Ldefault and 
conflated_SC sampled datasets. 

We estimate the completeness of the aggregated abstracts, which is defined as the fraction of 
abstracts harvested out of the total number of citations in a UHNAI publication. For example, 
if an individual UHNAI publication contains 20 referenced citations, and 15 corresponding cited 
abstracts were harvested, then the aggregated abstract is 75% complete. We randomly sampled 
(iV=100) abstracts from the 731 in the aggregated_abstracts dataset. We find that the average 
completeness for the aggregated abstracts in this sample is 74.3%, as shown in Table HI Therefore, 
we expect that on average our aggregated abstracts in the aggregated_abstracts dataset are ~74% 
complete. Interestingly, the UHNAI publications harvested from ADS have a higher degree of 
completeness than those abstracts harvested from WoK. 



3.5 Preprocessing of the Datasets 

We preprocessed the conflatedSC -default, conflatedSC sampled and aggregated-abstracts datasets 
in the same manner. Our preprocessing of the datasets included converting uppercase words to 
lowercase, and ignor ing non-alph abetical characters. We stemmed the words using the Porter 
stemming algorithm ( Porter . 1980l ) to ensure that related words were not duplicated in the datasets. 



We created a stopword list to remove formatting tags, and other non-content-bearing terms. We 
selected words which had a minimum frequency of 12, integrated over the entire datasets, resulting 
i n a total of ^4000 word s in each dataset. Most of our preprocessing was performed in WEKA 



(jWitten and Frank! . 120051 ). and the sIB method was also executed in this environment. 

We normalize each feature vector in our datasets. Each feature is described by the term fre- 
quency of each word found in the ~4000 words distilled from their respective datasets. We normalize 
the sum of each feature vector to 1. In the case of the aggregated abstracts, some feature vectors 
will be much shorter or longer than others, as there is a large range of abstract sizes, and number of 
references within a given publication. If we did not normalize the term frequencies, then instances 
with high word or low word counts may cluster together. Such clusters would be less revealing of 
the content of the documents themselves. Figure [3] reiterates the steps employed to construct our 
datasets. 



3.6 Limitations 

There are several limitations to our study. First, some of the papers were authored by multiple 
members of the UHNAI team. In this case, we assigned the abstract data to the first-listed author on 
the paper, thereby not fully characterizing the research contribution of the non-primary authors. 
Otherwise, having multiple labels on the same document would inadvertently oversample those 
documents with multiple UHNAI authors. Also, there is a minor discrepancy between the abstracts 
gathered in ADS and WoK; ADS contains abstracts from non-journal sources, whereas WoK does 
to a lesser extent for the researchers studied here. The vast majority of our data was from journal 
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articles; therefore, we do not expect this to have a significant, if any negative impact on our study. 
WoK maps multiple SCs to a single journal. While we need to conflate the SCs in order to compare 
them to clusters (in the conflatecLSC-default and conflatedSC sampled datasets), the aggregation 
procedure undermines the fundamental function of SCs. Furthermore, we reduced the total number 
of conflated SCs to 13 which may have a negative effect on our ability to assess interdisciplinary 
research. 

4 Results 

In this section we present the results of our text mining experiments. For the purposes of this 
paper, where our goal is to identify actual and potential instances of interdisciplinary research in 
astrobiology, a meaningful cluster relationship is one where papers from two or more SCs cluster 
together, or when researchers from different fields have the aggregated abstracts of their papers 
cluster together. Our present and future work is focused on these heterogeneous clusters, however 
our method could be used for a variety of purposes, each with a different corresponding indicator of 
interest. For example, a research team wishing to demonstrate its uniqueness within a collaboration 
might highlight its work being represented as a relatively homogenous cluster, with its dominant 
SC not found in other clusters. A group seeking to align or connect itself with researchers in a 
particular area might target clusters where their work and that of their target domain co-exist. 

4.1 Subject Categories as Document Labels 

We begin by estimating the extent to which conflated Web of Knowledge Subject Categories ac- 
curately describe the content of astrobiology publications. In Figure H] we visualize the results of 
clustering the abstract data before sampling as described in Section [3731 The same data is presented 
numerically in Table 5 in the online supplement. If SCs accurately reflect shared topical content 
of documents assigned to them, when the abstracts are clustered we should expect each SC to 
be primarily assigned a single cluster. However, when abstracts are assigned one of five clusters 
(Figure Hi-top panel), we observe that the cluster membership for most SCs is heterogeneous: there 
is no clear correspondence between a cluster and a single dominant SC. Even the most common SC, 
Astronomy & Astrophysics, is primarily distributed across the first three clusters, but is represented 
in all five. 

Table 5 does suggest some areas in which SCs may be more appropriate document labels. For 
example, Oceanography appears in only one cluster, and the Multidisciplinary Sciences SC is fairly 
evenly distributed across four of the five. However, when increasing the number of clusters to 10, 
15, and 20 (Figure H]), the heterogeneity of SCs within an individual cluster becomes even more 
pronounced. 

The dominance of the Astronomy & Astrophysics SC in the conflatedSC '-default dataset sug- 
gests that we also examine the cluster relationships after the dataset has been sampled. Figure [5] 
shows the distribution of SCs in 5 clusters over 3 trials, where the cluster results of each trial are 
not related to each other. For example, in successive trials, the same abstract may be assigned 
to different clusters; for the Astrophysics & Geophysics SC, each trial results in different cluster 
assignments, though the overall distribution of clusters is roughly equal, suggesting that there is 
little variability between trials. 

We present the data in Figure as a fraction of the total number of features in the con- 
flatedSC sampled dataset, where the number of features representing each SC are equal (Figure [2]). 
Furthermore, the data in Table 6 in the online supplement displays the data somewhat differently, 
where the fraction of features in each SC that are found in a given cluster is presented. Examining 
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the data with the tables provided in the online supplement makes interpreting the results easier in 
some instances. In Figure[5]and Table 6, the Astronomy, Biochemistry & Microbiology, and Physics 
SCs consistently cluster with their Multidisciplinary counterparts. Therefore, on the 5 cluster level, 
SCs seem to reasonably classify individual publications. 

Since five clusters may not be sufficient to reflect the diversity of content within astrobiology, we 
increase the number of clusters in subsequent trials. One would intuitively expect more SC hetero- 
geneity within each cluster; however, increasing the number of clusters also allows more potential of 
each SC to dominate a single cluster. When we increase the number of clusters to 10 (Figure El Ta- 
ble 7), we find that most of the SCs disperse into multiple clusters. One way to interpret this result 
is that more clusters allow finer distinctions between content to be revealed. For example, Physics 
and Physics-Multidisciplinary, which cluster together in each trial at the 5 cluster level, tend to 
cluster separately at the 10 cluster level. However, the Biochemistry & Microbiology SC and its 
multidisciplinary variant continue to have their abstracts cluster together. Moreover, from FigureEJ 
we observe that the abstracts in the Biotechnology & Applied Microbiology-Multidisciplinary SC 
consistently cluster together. At the 10 cluster level, more clusters contain single dominant SCs 
than at the 5 cluster level. 

Figure [3 (Table 8) and Figure M (Table 9) present the results of clustering the abstracts into 15 
and 20 clusters, respectively. We observe that many of the SCs are found distributed in multiple 
clusters. For example, at the 20 cluster level, what had been homogeneous cluster membership in 
the Biochemistry SCs at the 10 cluster level is split into three or more clusters, neither of which 
is shared across any other SC. Therefore, at these clustering levels, we operationalize a dominant 
SC within a cluster as one that either constitutes 50% or more of the abstracts alone, or one that 
is within 50% of the size of the most common ScH. By this approximation, the results at the 10 
cluster level hold: as a group, the Biochemistry and Biotechnology-related SCs dominate the fewest 
clusters; the Astronomy, Oceanography and Physics group slightly more, and the Geochemistry and 
Geophysics SCs are again the most diverse, short of the Multidisciplinary Sciences SC. Overall, at 
the 10 cluster level, more clusters contain single dominant SCs than at the 5, 15 or 20 cluster levels, 
and the usefulness of SCs as document labels reaches a relative maximum. 

In some cases, the trial processes reveal some inconsistencies in the cluster membership of 
SCs. For example, in the Biotechnology & Applied Microbiology-Multidisciplinary SC, one would 
expect to have diverse membership at the 15 cluster level (Figure [7]). However, the Biotechnology 
& Applied Microbiology-Multidisciplinary SC is dominant in one cluster in trials 1 and 3, and 
is dominant in three clusters in trial 2. While these results may be an artifact of the sampling 
and multiple-trials processes, we would expect and find that the two related SCs, Biochemistry 
& Molecular Biology and Biochemistry & Molecular Biology-Multidisciplinary are found mostly 
within the same clusters. This observation also holds for the Geochemistry & Geophysics and 
Geochemistry & Geophysics-Multidisciplinary SCs. The multidisciplinary SC variants (BioChem 
& MBio, BioChem & MBio-M and GeoChem & GeoPhys, GeoChem & GeoPhys-M) are slightly 
more diverse than their associated core SC, but there is a high degree of similarity between the 
abstracts in these two sets of related SCs. Therefore, we conclude that even with some observed 
inconsistencies, the clusterer is collocating related abstracts across related SCs. 

Certain related SCs tend to consistently cluster together, which suggests that SCs are sufficient 
for characterizing astrobiology publications. However, other SCs have a limited effectiveness as 
document labels in this interdisciplinary domain, as some SCs did not map well to successively 
smaller cluster sizes. Therefore, our results suggest that WoK SCs may not consistently reflect the 

For example, a cluster with SCs constituting 30%, 18%, 16% and 12% of the abstracts would have three dominant 
SCs. 
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diverse content of astrobiology publications. 

4.2 Utilizing Subject Categories to Assess Interdisciplinarity 

In this section, we attempt to leverage the heterogeneity of SC cluster membership to assess the 
interdisciplinarity of astrobiology publications, and analyze only the sampled data to de-emphasize 
the dominance of the Astronomy & Astrophysics SC in the dataset. Furthermore, many observa- 
tions are similar to those discussed in Section 14, H therefore, we will only mention in brief possible 
interdisciplinary connections that can be found by utilizing SCs. 

In Figure the clustering technique correctly collocates obviously similar SCs across all trials, 
but also identifies some less obvious potential interdisciplinary connections. For example, across all 
trials in Table 6, 9% of the abstract data in Biochemistry & Molecular Biology-Multidisciplinary 
clusters with 11-14% of the abstracts in Geochemistry & Geophysics. Given a document corpus 
of relatively equal SC distribution, as we have approximated here by the sampling process, these 
results suggest that papers from different SCs that consistently cluster together should be targeted 
for investigation by researchers interested in potential connections between the two fields. 

Across all three trials at the 10 cluster level in Figure a single clearly dominant SC could be 
identified in 27 of the 30 clusters. The Astronomy, Oceanography and Physics SCs demonstrated 
somewhat less monodisciplinary dominance at the 10 cluster level; all had roughly 20% of their 
abstracts assigned to other clusters. The Geochemistry & Geophysics and Environmental Sciences 
SCs demonstrated the most diversity apart from the pure Multidisciplinary Sciences SC, though 
somewhat surprisingly, the Geochemistry & Geophysics-Multidisciplinary SC appeared in fewer 
clusters than its core SC. 

These findings yield several possible interpretations and applications. We would expect all astro- 
biology researchers to publish and cite primarily within their home disciplines, but as these results 
suggest, the norms of disciplinary diversity vary by field. A potential application of this approach 
is a field-specific baseline metric of interdisciplinarity, a method by which an individual's research 
output can be compared to others in the same field in terms of the potential interdisciplinary ap- 
plicability of their work. This process could also result in an aggregate metric of interdisciplinarity 
for research teams via their past published work, while addressing the primary goal of discovering 
latent connections between the work of diverse researchers for the present and future. 

Analyzing the heterogeneous cluster membership of publications from diverse SCs is one way 
to assess interdisciplinary research possibilities, but the probabilistic nature of this method should 
be emphasized. A heterogeneous cluster could indicate that SCs are poor document labels, or that 
the clustering level should be adjusted to better match the data and metadata, or that a potential 
interdisciplinary relationship exists. In either case, this process could inform targeted, iterative 
investigation. 

4.3 Text Mining the Aggregated Abstracts 

The sIB technique was employed to cluster the abstracts of the publications by the UHNAI team 
and the references within these publications. Figure [9J shows the results of clustering the data into 
5 clusters. The results indicate that authors from their respective home disciplines cluster together 
(see Table[3]for the list of authors and their respective home disciplines). For example, the geologists 
Krot, Keil, Huss, Scott, and Jogo are strongly represented in cluster 4. One exception is Taylor 
(geologist) who clusters with the oceanographers (Cowen and Mottl). Additionally, Schorghofer (an 
astronomer by departmental affiliation) also clusters with the oceanographers. Furthermore, the 
astrochemists (Bennett and Kaiser) have all of their publications in cluster 1. This result suggests 
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that the sIB technique is able to cluster similar research on a high-level; however, utilizing more 
clusters should provide a lower-level view of overlap in research interests between the authors. 

When running the sIB technique for 10 clusters, we begin to see where researchers may find 
potential collaboration opportunities, and we observe which authors have specialized or broad re- 
search interests. Research can be specialized but still integrate methods, techniques and data from 
multiple disciplines. We believe that an author who is represented primarily in a single cluster 
may not be engaging frequently in interdisciplinary research, or may be focusing on narrow re- 
search problems, or using similar research methods or equipment. In Figure \W\ we see that the 
two astrochemists (Bennett and Kaiser) are entirely represented by cluster 8, consistent with the 
results presented in Figure EE We know that their research is heavily influenced by their experi- 
mental apparati, thus suggesting that the experimental methods and apparati significantly affect 
the description of a research track. Interestingly, Schorghofer's research is on various planetary 
bodies such as Mars and the Moon, which is also true of Taylor. Therefore, clustering the text of 
the aggregated abstracts sufficiently illuminates similarities in research tracks across disciplinary 
boundaries, in this case, between astronomy and geology. 

In Figure [TTT we observe that Huss, Jewitt, Krot and Meech's research is found in many clusters. 
This signifies that their research is likely to be very interdisciplinary. With regards to those authors 
represented by a few clusters, we cannot conclude that their research is absolutely mono-disciplinary, 
as it may be very specialized, or utilize the same methods or apparati. However, we believe that 
those UHNAI authors with publications in multiple clusters are more likely to be engaged in 
interdisciplinary research. In Figure \12\ we observe that of the senior (non-postdoctoral fellows) 
astronomers (Reipurth, Meech, Jewitt, Haghighipour, Owen, Schorghofer) half (Meech, Jewitt, 
and Owen) are fairly diverse in their research interests and the other half (Reipurth, Haghighipour, 
Schorghofer) are engaged in specialized or mono-disciplinary research. 

These results suggest that the sIB method, in combination with aggregated abstracts, can illu- 
minate areas of implicit commonality where the research areas of scientists from diverse disciplines 
overlap. Furthermore, while clusters do not inherently relate any information about a researcher's 
discipline, it is clear that researchers from the same department often cluster together. Therefore, 
we expect that performing a similar analysis on the entire NASA Astrobiology Institute will show 
where collaborations between researchers can occur, and can assist NASA with outlining research 
priorities. These results can serve as the framework for a geospatial visualization of common yet 
unconnected research tra c ks an d potential collaborators, similar to the "hot regions" described by 
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5 Discussion and Conclusions 

We clustered astrobiology abstract data to evaluate SCs as document labels. We attempt to recon- 
cile clustering (bottom-up approach) with pre-defined categories (top-down approach). The clusters 
produced by text mining the abstract data did not generally correspond well to the SCs. There- 
fore, we conclude that SCs are not well suited to the classification of astrobiology publications, 
and speculate that this may also be true for other interdisciplinary fields. One explanation is 
that astrobiology research outputs cite mono-disciplinary and interdisciplinar y publication s which 
may prevent SCs from forming cohesive clusters. Additionally, as discussed in ISmalll (|2010h . many 



journals publish highly diverse content, which no journal-level classification system could represent 
completely. The class imbalance problem in our dataset requires us to explore utilizing an oversam- 
pling technique. While we believe that the method remedies the skewed distribution of conflated 
SCs in our dataset, performing a text mining clustering analysis on a balanced astrobiology dataset 
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without oversampling may produce different results. That is, SCs may be more accurate when 
the distribution of SCs is uniform without oversampling using synthetic data. Nonetheless, the 
distribution of departmental affiliations of the UHNAI researchers is skewed, which affects the dis- 
tribution of publications across different SCs; it is likely that this scenario will be consistent with 
the other NASA Astrobiology Institute teams. 

Our results suggest that 10 clusters may be the most appropriate level at which to analyze 
the astrobiology collection (Figure [6j) . Too few clusters and the interdisciplinary diversity of the 
source documents is not well represented; too many and they may be oversegregated, lessening 
the chance to identify potential commonalities in documents from different disciplines and SCs. 
We suggest that when documents from different SCs cluster together, this may indicate implicit 
interdisciplinary connection, where knowledge in one field might inform another. Having researchers 
from the constituent disciplines evaluate these common documents may provide one mechanism by 
which interdisciplinary science can take place, and provide a starting point for potentially productive 
interdisciplinary collab orations . 

Similarly, text mining the aggregated abstracts using the sIB method is also suited to the task 
of finding collaboration opportunities. Our experiments consistently showed that authors from the 
same academic department tended to have their publications cluster together. If this were not 
the case, we would be unable to make any claims regarding the similarity of publications within 
a given cluster. We suggest that authors whose publications cluster together could collaborate 
productively. An author that has publications in many clusters indicates that they are engaged in 
interdisciplinary research, or perhaps that they are not, but should be. Those authors with few 
publications were either underrepresented in WoK and ADS, or were post-doctoral fellows at the 
UHNAI. We find that the majority of publications by UHNAI investigators and post-doctoral fellows 
appear in multiple clusters, providing evidence of actual or potential interdisciplinary research. This 
is an encouraging result, as promoting boundary-crossing scientific research is one of the goals of the 
NASA Astrobiology Institute. Younger generations of researchers will need to synthesize techniques 
from multiple disciplines to answer some of the most fundamental questions in science in general, 
and astrobiology in particular. 

We insinuated that a strong conclusion cannot be made regarding those authors that are strongly 
represented in a single cluster. Research in this context is either: 1) interdisciplinary but specialized, 
perhaps incorporating a synthesis between methods, techniques and data from multiple disciplines, 
but with a narrow scope or 2) mono-disciplinary. Distinguishing between these two cases requires 
studying the individual works in each cluster. Additionally, such an analysis would lead to nar- 
rowing the scope of collaboration between two or more researchers that are found within a single 
cluster. This analysis will be explored in future work. 

The context of the interdisciplinary field of astrobiology has permitted us to explore a method of 
measuring interdisciplinarity, and identify potential collaboration opportunities. We find that most 
of the UHNAI team are engaged in interdisciplinary research, and that our method suggests where 
productive interdisciplinary collaborations could occur. We believe our method, which combines 
bibliometrics and machine learning, makes valid predictions, based on our a priori knowledge of the 
structure of the research team and those intra-team collaborations that currently exist. Bibliometric 
studies of interdisciplinarity can benefit when augmented with machine learning algorithms, in an 
attempt to understand the fine-grained details of interdisciplinary research. 
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Figure 1: Depiction of feature vectors as constructed from abstract data. 
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Figure 2: The distribution of real (black) and synthetic (white) data in the conflatedSC 'sampled 
dataset. 
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Figure 3: An overview of the steps in our methodology. The UHNAI authors and publications to 
be harvested are are shown as region A and outlined in Section [3j The motivation for using the 
sIB is discussed in Section 13.11 The data collection procedure is discussed in Section 13.21 and is 
described by region B. The method for creating our conflated subject categories is shown as region 
C and discussed in Sections 13.31 and [331 The methpd for creating our aggregated abstracts is shown 
as region D and discussed in Sections 13.41 and 13.51 
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Figure 4: The results of clustering the conflated_SC -default dataset. Results are given for 5, 10, 15 
and 20 clusters from top to bottom. 
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Figure 5: The results of clustering the con flated}§C sampled dataset in three separate trials. Each 
abstract is assigned one of 5 clusters. 
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Figure 6: The results of clustering the conflated_SC_sampled dataset in three separate trials. Each 
abstract is assigned one of 10 clusters. 
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Figure 7: The results of clustering the conflatedJELCsampled dataset in three separate trials. Each 
abstract is assigned one of 15 clusters. 
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Figure 8: The results of clustering the conflated2$Csampled dataset in three separate trials. Each 
abstract is assigned one of 20 clusters. 
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Figure 9: Clustering the aggregated_abstracts dataset using 5 clusters. This plot ensures that we are 
not obtaining extremely unlikely correlations and shows that researchers from the same academic 
department are largely clustering together. For example, Bennett is a post-doctoral fellow working 
with Kaiser; they often publish together and their aggregated abstracts are clustering entirely in 
cluster 1. As another example, the geologists/geophysists Krot, Keil, Huss, Scott and Jogo are all 
strongly represented in cluster 4. The one exception is Taylor, who appears to be clustering more 
strongly with the two oceanographers (Cowen and Mottl). As expected, researchers have the most 
in common with those in their home discipline. 
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Figure 10: Clustering the aggregated_abstracts dataset using 10 clusters. The two oceanographers 
(Cowen and Mottl) have all of their papers clustering together in Cluster 5. The same is true 
of Bennett and Kaiser (Astrochemistry). In the previous figure, Taylor was clustering with the 
oceanographers. However, we can see here that Taylor's work is similar to that of Schorghofer's, 
despite their membership to different home disciplines (Geology, and Astronomy respectively). 
Rather striking is the mono-disciplinarity regarding Reipurth's research. 
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Figure 11: Clustering the aggregated-abstracts dataset using 15 clusters. In this figure, Bennett 
and Kaiser are no longer entirely represented by a single cluster. When we utilized 5 and 10 
clusters, Binsted and Gazan (Computer Science) and Freeland (Biology) had their publications 
cluster together. We know in particular that the research by the computer scientists is likely 
to be the most dissimilar to all authors from other home disciplines. However, when clustering 
with 15 clusters, we observe Binsted's research depart from the cluster that contains Gazan and 
Freeland's research and that the research has a tangential relation to research produced by other 
team members. 
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Figure 12: Clustering the aggregated_abstracts dataset using 20 clusters. If we assume that member- 
ship to many clusters indicates a high degree of interdisciplinarity, Huss is the most interdisciplinary 
UHNAI team member. Of the senior Astronomers (Reipurth, Meech, Jewitt, Haghighipour, Owen, 
Schorghofer) half (Meech, Jewitt, and Owen) are fairly diverse in their research interests, or engage 
in IDR, and the other half (Reipurth, Haghighipour, Schorghofer) are engaged in specialized or 
mono-disciplinary research. As younger researchers, the UHNAI post-doctoral fellows appear to be 
engaging in interdisciplinary research. 



26 



Table 1: Mapping of WoK subject categories to conflated subject categories. 



Conflated Subject Category 



WoK Source Subject Category 



Agricultural Engineering-Multidisciplinary 



Agricultural Engineering with Fisheries 



Astronomy Sz Astrophysics 



Astronomy Sz Astrophysics 



Astronomy Sz Astrophysics-Multidisciplinary 



Astronomy Sz Astrophysics-Multidisciplinary or 
Astronomy Sz Astrophysics with: Mechan- 
ics; Engineering-Aerospace; History; Multidis- 
ciplinary Sciences; Physics; Spectroscopy 



Astrophysics Sz Geophysics 



Astronomy Sz Astrophysics with: Geochemistry 
Sz Geophysics; Geosciences-Multidisciplinary 



Biochemistry Sz Molecular Biology 



Biochemistry Sz Molecular Biology 



Biochemistry Sz 
Multidisciplinary 



Molecular Biology- 



Biochemistry Sz Molecular Biology with: Bio- 
chemical Research Methods; Chemistry, Ana- 
lytical; Biotechnology Sz Applied Microbiology; 
Mathematical Sz Computational Biology; Biol- 
ogy; Biophysics; Cell Biology; Computer Sci- 
ence, Interdisciplinary Applications; Genetics 
Sz Heredity; Medicine, Research Sz Experimen- 
tal; Chemistry, Analytical; Chemistry, Medici- 
nal; Chemistry, Organic; Pharmacology Sz Phar- 
macy; Evolutionary Biology; Microbiology; Im- 
munology; Infectious Diseases 



Biology 



Biology 



Biology-Multidisciplinary 



Biology with: Ecology; Evolutionary Biology; 
Environmental Sciences; Mathematical Sz Com- 
putational Biology 



Biology Sz Geology 



Biology or Environmental Sciences with Geo- 
sciences, Multidisciplinary 



Biotechnology Sz Applied Microbiology 



Biotechnology Sz Applied Microbiology 



Biotechnology Sz 
Multidisciplinary 



Applied Microbiology- 



Biotechnology Sz Applied Microbiology with: 
Food Science Sz Technology; Microbiology; Ge- 
netics Sz Heredity; Marine Sz Freshwater Biology 



Chemistry 



Chemistry; Chemistry, Analytical; Chemistry, 
Physical; Chemistry, Organic; Chemistry, Inor- 
ganic Sz Nuclear 
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Conflated Subject Category WoK Source Subject Category 

Chemistry-Multidisciplinary Chemistry, Multidisciplinary or Chemistry or 

Chemistry, Analytical with: Spectroscopy; 
Chemistry, Medicinal; Nanoscience & Nanotech- 
nology; Materials Science, Multidisciplinary; 
Chemistry, Applied; Computer Science, In- 
formation Systems; Computer Science, In- 
terdisciplinary Applications; Pharmacology & 
Pharmacy; Environmental Sciences; Toxicol- 
ogy; Physics, Condensed Matter; Engineering, 
Chemical; Mathematics & Computational Biol- 
ogy; Oceanography; Nuclear Science & Technol- 
ogy; Polymer Science 



Chemistry Sz Physics 


Chemistry with Physics 


Chemistry & Physics-Multidisciplinary 


Chemistry and Physics with Nuclear Science & 




Technology 


Computer Science 


Computer Science 



Computer Science-Multidisciplinary Computer Science, Multidisciplinary or Com- 



puter Science with: Information Science & 
Library Science; Cybernetics; Computer Sci- 
ence, Artificial Intelligence; Computer Sci- 
ence, Theory & Methods; Engineering, Elec- 
trical & Electronic; Computer Science, Hard- 
ware & Architecture; Computer Science, Infor- 
mation Systems; Computer Science, Interdisci- 
plinary Applications; Geosciences, Multidisci- 
plinary; Physics, Fluids &: Plasmas 



Crystallography 




Crystallography 


Education 




Education or Education with Multidisciplinary 
Sciences 


Engineering 




Engineering; Engineering, Instruments & In- 
strumentation; Engineering, Electrical &: Elec- 
tronic; Engineering, Mechanical 


Environmental Sciences & Ecology 


Environmental Sciences or Ecology 


Environmental Sciences 


& Ecology- 


Environmental Sciences or Ecology with: Lim- 


Multidisciplinary 




nology; Evolutionary Biology; Marine & Fresh- 
water Biology; Microbiology; Oceanography; 
Engineering, Civil; Water Resources; Engineer- 
ing, Environmental; Engineering, Chemical; Ge- 
ology; Meteorology & Atmospheric Sciences; 
Geography, Physical; Geosciences, Multidisci- 
plinary; Soil Science; Toxicology; Fisheries 


Genetics & Heredity 




Genetics & Heredity 


Genetics & Heredity-Multidisciplinary 


Genetics & Heredity with: Ecology; Evolution- 






ary Biology 


Geochemistry & Geophysics 




Geochemistry & Geophysics 
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Conflated Subject Category 



WoK Source Subject Category 



Geochemistry & Geophysics-Multidisciplinary Geochemistry & Geophysics-Multidisciplinary 

or Geochemistry & Geophysics with: Geology; 
Meteorology & Atmospheric Sciences; Mineral- 
ogy; Geography, Physical; Geosciences, Multi- 
disciplinary; Paleontology 



Geography 


Geography, Physical 




Oeolnp'v 


GeoloEfv-Multidiscinlinarv 


Geolotrv or Geosciences Multidiscinlinarv 




with: Energy & Fuels; Engineering, Petroleum; 




Mineralogy; Mining & Mineral Processing; Pa- 




leontology; Geography, Physical; Mathematics, 




Interdisciplinary Applications 


Geology &: Oceanography 


Geology with Oceanography 


Geonhvsics Oceanosranhv 


Geochemistrv Geonhvsics with Oceanoora,- 




Dhv 

r v 


Instruments Instrumentation 


Instruments Instrumentation 

-1 11' ' V A. lllllVjll U UK; 1 lli ' 11 lllllVjll U K-Al 11VJ11 


Tiifp Slpiprippti At" RinTYipHipiinp-A/TiiltiHi^pi nlinarv 

-UHC kJ^lCll^CO Oo 1_J 1U 1 11C Lll^ 111C IV J. Lll Lj1*_11o<_/1|J11110j1 


T,ifp Spiptipp^ at" T^imnprlipi'np OfVipr TVi'nipt;* A/Tnl- 

UHC kJ V/ldlV_/CO UO UHJlllCVJ-lv^lilO 5 V. .7 Li 11C J- -LVJ^JIL-^Q j 1V1. LLI 




tidisciplinary Sciences; Science & Technology, 




Other Topics 


Materials Science 


Materials Science 


Materials Srienre-MnltiHisrinlinarv 


Materials Science MnltiHiscinlinarv or Materi- 




als Science with Physics, Metallurgy Sz Metal- 




lurgical Engineering 


Mathematical & Computational Biology 


Mathematical &: Computational Biology 


Mathematics 


Mathematics; Mathematics, Applied; Statistics 




& Probability 


Medicine 


Medical Sciences; Psychology, Clinical; 




Medicine, General &: Internal; Public, En- 




vironmental & Occupational Health; Sport 




Sciences 


Meteorology & Atmospheric Sciences 


Meteorology & Atmospheric Sciences 


Meteorology & Oceanography 


Meteorology &: Atmospheric Sciences with 




Oceanography 


Microbiology 


Microbiology 


Mineralogy 


Mineralogy 


Multidisciplinary Sciences 


Multidisciplinary Sciences 


Neurosciences 


Neurosciences 


Nutrition & Dietetics 


Nutrition k, Dietetics 


Oceanography 


Oceanography 


Oceanography & Marine Biology 


Oceanography and Marine &z Freshwater Biol- 




ogy or Limnology 


Optics 


Optics 


Optics-Multidisciplinary 


Optics with: Spectroscopy; Engineering, Mul- 




tidisciplinary 


Pharmacology & Pharmacy 


Pharmacology & Pharmacy 
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Conflated Subject Category 


WoK Source Subject Category 


Physics 


Physics; Physics, Fluids & Plasmas 


Physics- Multidisciplinary 


Physics, Multidisciplinary or Physics with: 




Mechanics; Physics, Particles & Fields; Physics, 




Nuclear; Nuclear Science & Technology; 




Physics, Atomic, Molecular & Chemical; Chem- 




istry, Physical; Instruments & Instrumentation; 




Optics; Thermodynamics; Energy & Fuels 


Psychology 


Psychology 


Spectroscopy 


Spectroscopy 


Spectroscopy-Multidisciplinary 


Spectroscopy with: Chemistry, Physical; 




Chemistry, Analytical; Physics, Atomic, Molec- 




ular Sz Chemical 


Virology 


Virology 


Zoology 


Zoology 
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Table 2: The distribution of conflated subject categories, their corresponding abstracts and the 
fraction of abstracts obtained through ADS and WoK. 



Subject Category 


Number of 


Fraction of 


ADS 


WoK 




abstracts 


dataset (%) 


(%) 


(%) 


Astronomy & Astrophysics [Astro] 


6914 


67.68 


98.7 


1.3 


Astronomy & Astrophysics-Multidisciplinary [Astro-M] 


66 


0.65 


98.5 


1.5 


Astrophysics & Geophysics [Astro &; GeoPhys] 


364 


3.56 


93.4 


6.6 


Biochemistry &; Molecular Biology [BioChem & MBio] 


61 


0.6 





100 


Biochemistry & Molecular Biology-Multidisciplinary 
[BioChem & MBio-M] 


109 


1.07 


0.9 


99.1 


Biotechnology & Applied Microbiology-Multidisciplinary 
[BioTech & AMBio-M] 


58 


0.57 





100 


Environmental Sciences & Ecology-Multidisciplinary [EnvSc 
& Eco-M] 


66 


0.65 


1.5 


98.5 


Geochemistry & Geophysics [GeoChem & GeoPhys] 


978 


9.6 


65.1 


34.9 


Geochemistry & Geophysics-Multidisciplinary [GeoChem & 
GeoPhys-M] 


491 


4.8 


45.8 


54.2 


Multidisciplinary Sciences [Multidisciplinary] 


830 


8.12 


78.9 


21.1 


Oceanography 


55 


0.54 





100 


Physics 


86 


0.84 


100 





Physics- Multidisciplinary [Physics- M] 


138 


1.35 


98.6 


1.4 
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Table 3: Home discipline of the authors at the University of Hawaii NASA Astrobiology Institute. 
An asterisk (*) denotes a post-doctoral researcher. 



Author 


Departmental Affiliation/Home Discipline 


Bennett* 


Unemistry 


Binsted 


Computer Science 


Cowen 


Oceanography 


Freeland 


Biology 


Gazan 


Computer hcience 


Haghighipour 


Astronomy 


TT 

Huss 


Geology 


Jewitt 


Astronomy 


Jogo* 


Geology 


Kaiser 


Chemistry 


Keane* 


Astronomy 


Ken 


Geology 


Kleyna* 


Astronomy 


TV J- 

Krot 


Geology 


Meech 


Astronomy 


Mottl 


Oceanography 


Owen 


Astronomy 


Reipurth 


Astronomy 


Riesen* 


Astronomy 


Sarid* 


Astronomy 


Schorghofer 


Astronomy 


Scott 


Geology 


Taylor 


Geology 


Yang* 


Astronomy 



Table 4: Statistics for the UHNAI aggregated abstracts dataset. A sample (N = 100) of the total 
number of UHNAI publications is shown to estimate the completeness of the aggregated abstracts. 





ADS 


WoK 


Total 


Total UHNAI publications in the dataset 


655 (89.6%) 


76 (10.4%) 


731 


Number of publications randomly selected 


88 


12 


100 


Total number of references found across the sample 


3426 


756 


4182 


Total referenced abstracts harvested 


2908 


386 


3294 


Average completeness of the aggregated abstracts 1 


77.5% 


50.3% 


74.3% 



x The average completeness is measured as the mean of the completeness of each individual 
aggregated abstract in the sample. 
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Table 5: Clustering the conflatedSC -default dataset with 5, 10, 15, and 20 clusters. 
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Table 6: Clustering abstracts in the conflatedSC sampled dataset with 5 clusters. 



Astro 


Astro- 


Astro & 


BioChem 


BioChem 


BioTech & 


EnvSc & 


GeoChem & 


GeoChem & 


Multidisc- 


Oceano- 


Physics 


Physics- 




M 


GcoPhys 


& MBio 


& MBio-M 


AMBio-M 


Eco-M 


GeoPhys 


GeoPhys-M 


iplinary 


graphy 




M ' 


Trial 1 


1 0.05 


0.00 


0.00 


0.00 


0.00 


0.00 


0.31 


0.02 


0.01 


0.02 


0.00 


0.76 


0.99 


2 0.00 


0.00 


0.00 


0.00 


0.09 


0.93 


0.68 


0.14 


0.09 


0.05 


1.00 


0.00 


0.00 


3 0.89 


0.98 


0.10 


0.00 


0.01 


0.00 


0.00 


0.07 


0.25 


0.56 


0.00 


0.17 


0.01 


4 0.00 


0.00 


0.00 


1.00 


0.90 


0.07 


0.00 


0.00 


0.00 


0.04 


0.00 


0.00 


0.00 


5 0.06 


0.01 


0.90 


0.00 


0.00 


0.00 


0.01 


0.77 


0.65 


0.33 


0.00 


0.07 


0.00 


Trial 2 


1 0.04 


0.00 


0.88 


0.00 


0.00 


0.00 


0.00 


0.69 


0.21 


0.24 


0.00 


0.08 


0.00 


2 0.61 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.01 


0.10 


0.00 


0.90 


1.00 


3 0.00 


0.00 


0.00 


0.00 


0.09 


0.95 


0.99 


0.14 


0.08 


0.05 


1.00 


0.00 


0.00 


4 0.35 


0.99 


0.12 


0.00 


0.01 


0.00 


0.01 


0.16 


0.70 


0.56 


0.00 


0.03 


0.00 


5 0.00 


0.00 


0.00 


1.00 


0.90 


0.05 


0.00 


0.01 


0.00 


0.04 


0.00 


0.00 


0.00 


Trial 3 


1 0.33 


0.99 


0.07 


0.00 


0.01 


0.00 


0.34 


0.09 


0.31 


0.47 


0.00 


0.00 


0.03 


2 0.05 


0.00 


0.92 


0.00 


0.00 


0.00 


0.00 


0.79 


0.63 


0.34 


0.00 


0.04 


0.00 


3 0.00 


0.00 


0.00 


0.00 


0.09 


0.96 


0.66 


0.11 
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0.05 


1.00 
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0.00 
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0.00 


1.00 


0.90 


0.04 


0.00 


0.00 


0.00 


0.04 


0.00 


0.00 


0.00 


5 0.62 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.01 


0.01 


0.11 


0.00 


0.95 


0.97 



Table 7: Clustering abstracts in the conflatedSC sampled dataset with 10 clusters. 
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0.00 
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0.00 


3 0.08 


0.78 


0.01 


0.00 


0.02 


0.00 


0.00 


0.04 


0.05 


0.13 


0.00 


0.02 


0.07 


4 0.06 


0.17 


0.12 


0.00 


0.00 


0.00 


0.01 


0.19 


0.70 


0.24 


0.00 


0.00 


0.00 


5 0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.14 


0.90 


6 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.88 


0.02 


0.00 


0.00 


0.19 


0.00 


0.00 


7 0.03 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.81 


0.02 


8 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.18 


0.12 


0.04 


0.81 


0.00 


0.00 


9 0.78 


0.05 


0.05 


0.00 


0.00 


0.00 


0.00 


0.02 


0.06 


0.32 


0.00 


0.03 


0.01 


10 0.00 


0.00 


0.00 


0.00 


0.11 


0.99 


0.11 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


Trial 2 


1 0.02 


0.00 


0.09 


0.00 


0.00 


0.00 


0.00 


0.27 


0.71 


0.18 


0.00 


0.00 


0.00 


2 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.87 


0.02 


0.00 


0.00 


0.19 


0.00 


0.00 


3 0.53 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.09 


0.00 


0.76 


0.02 


4 0.03 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.01 


0.00 


0.19 


0.97 


5 0.00 


0.00 


0.00 


0.00 


0.11 


0.99 


0.13 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


6 0.04 


0.72 


0.01 


0.00 


0.03 


0.00 


0.00 


0.03 


0.03 


0.07 


0.00 


0.00 


0.01 


7 0.35 


0.27 


0.10 


0.00 


0.00 


0.00 


0.00 


0.04 


0.11 


0.38 


0.00 


0.03 


0.00 


8 0.02 


0.00 


0.79 


0.00 


0.00 


0.00 


0.00 


0.52 


0.06 


0.18 


0.00 


0.02 


0.00 


9 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.11 


0.09 


0.03 


0.81 


0.00 


0.00 


10 0.00 


0.00 


0.00 


1.00 


0.86 


0.01 


0.00 


0.00 


0.00 


0.03 


0.00 


0.00 


0.00 


Trial 3 


1 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.59 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


2 0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.01 


0.00 


0.16 


0.96 


3 0.07 


0.96 


0.03 


0.00 


0.02 


0.00 


0.00 


0.03 


0.05 


0.13 


0.00 


0.01 


0.01 


4 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.29 


0.03 


0.02 


0.01 


1.00 


0.00 


0.00 


5 0.00 


0.00 


0.00 


0.00 


0.13 


0.99 


0.12 


0.01 


0.00 


0.02 


0.00 


0.00 


0.00 


6 0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.80 


0.01 


7 0.82 


0.04 


0.07 


0.00 


0.00 


0.00 


0.00 


0.02 


0.08 


0.36 


0.00 


0.02 


0.01 


8 0.03 


0.00 


0.08 


0.00 


0.00 


0.00 


0.00 


0.34 


0.78 


0.23 


0.00 


0.00 


0.00 


9 0.03 


0.00 


0.83 


0.00 


0.00 


0.00 


0.00 


0.53 


0.06 


0.20 


0.00 


0.02 


0.00 


10 0.00 


0.00 


0.00 


1.00 


0.84 


0.01 


0.00 


0.00 


0.00 


0.03 


0.00 


0.00 


0.00 



Table 8: Clustering abstracts in the conflatedSC sampled dataset with 15 clusters. 



Astro 


Astro- 


Astro & 


BioChem 


BioChem 


BioTech & 


EnvSc & 


GeoChem & 


GeoChem & 


Multidisc- 


Oceano- 


Physics 


Physics- 




M 


GeoPhys 


& MBio 


& MBio-M 


AMBio-M 


Eco-M 


GeoPhys 


GeoPhys-M 


iplinary 


graphy 




M ' 


Trial 1 


1 0.00 


0.00 


0.00 


0.58 


0.31 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


2 0.03 


0.00 


0.64 


0.00 


0.00 


0.00 


0.00 


0.46 


0.05 


0.16 


0.00 


0.00 


0.00 


3 0.00 


0.00 


0.00 


0.00 


0.12 


0.99 


0.06 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


4 0.02 


0.66 


0.01 


0.00 


0.02 


0.00 


0.00 


0.01 


0.01 


0.04 


0.00 


0.00 


0.01 


5 0.06 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.04 


0.00 


0.46 


0.10 


6 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.02 


0.85 


7 0.01 


0.00 


0.22 


0.00 


0.00 


0.00 


0.00 


0.28 


0.54 


0.15 


0.00 


0.00 


0.00 


8 0.00 


0.00 


0.00 


0.42 


0.55 


0.01 


0.00 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


9 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.44 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


10 0.10 


0.20 


0.03 


0.00 


0.00 


0.00 


0.00 


0.07 


0.29 


0.24 


0.00 


0.00 


0.00 


11 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


0.45 


0.03 


12 0.54 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.07 


0.00 


0.03 


0.01 


13 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.50 


0.02 


0.00 


0.00 


0.20 


0.00 


0.00 


14 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.12 


0.10 


0.03 


0.80 


0.00 


0.00 


15 0.21 


0.14 


0.08 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.21 


0.00 


0.03 


0.00 


Trial 2 


1 0.00 


0.00 


0.00 


0.24 


0.24 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.44 


0.03 


0.00 


0.00 


0.00 


0.00 


0.00 


3 0.00 


0.00 


0.00 


0.51 


0.27 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


4 0.03 


0.01 


0.08 


0.00 


0.00 


0.00 


0.00 


0.25 


0.73 


0.22 


0.00 


0.00 


0.00 


5 0.03 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.17 


0.96 


6 0.00 


0.00 


0.00 


0.09 


0.36 


0.02 


0.00 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


7 0.00 


0.00 


0.00 


0.00 


0.07 


0.34 


0.24 


0.02 


0.00 


0.03 


0.01 


0.00 


0.00 


8 0.04 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.79 


0.01 


9 0.00 


0.00 


0.00 


0.16 


0.04 


0.43 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


10 0.06 


0.65 


0.01 


0.00 


0.00 


0.00 


0.00 


0.03 


0.04 


0.10 


0.00 


0.00 


0.02 


11 0.16 


0.33 


0.10 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.17 


0.00 


0.03 


0.00 


12 0.02 


0.00 


0.80 


0.00 


0.00 


0.00 


0.00 


0.53 


0.06 


0.18 


0.00 


0.01 


0.00 


13 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.11 


0.09 


0.02 


0.80 


0.00 


0.00 


14 0.00 


0.00 


0.00 


0.00 


0.01 


0.21 


0.31 


0.00 


0.00 


0.00 


0.20 


0.00 


0.00 


15 0.66 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.07 


0.22 


0.00 


0.00 


0.01 


Trial 3 


1 0.00 


0.00 


0.00 


0.00 


0.12 


0.99 


0.06 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


2 0.08 


0.00 


0.04 


0.00 


0.00 


0.00 


0.00 


0.02 


0.02 


0.06 


0.00 


0.32 


0.13 


3 0.61 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.10 


0.00 


0.00 


0.01 


4 0.10 


0.06 


0.02 


0.00 


0.00 


0.00 


0.01 


0.07 


0.31 


0.26 


0.00 


0.00 


0.00 


5 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.50 


0.00 


0.00 


0.00 


0.21 


0.00 


0.00 


6 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.03 


0.85 
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U.UU 
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U.UU 
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U.UU 


n no 
U.Uz 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


11 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.13 


0.09 


0.03 


0.79 


0.00 


0.00 


12 


0.03 


0.56 


0.01 


0.00 


0.02 


0.00 


0.00 


0.01 


0.01 


0.04 


0.00 


0.00 


0.00 


13 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.62 


0.01 


14 


0.00 


0.00 


0.00 


0.54 


0.14 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


15 


0.12 


0.37 


0.09 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.15 


0.00 


0.02 


0.00 



00 



Table 9: Clustering abstracts in the conflatedSC sampled dataset with 20 clusters. 



Astro 


Astro- 


Astro & 


BioChem 


BioChem 


BioTech & 


EnvSc & 


GeoChem & 


GeoChem & 


Multidisc- 


Oceano- 


Physics 


Physics- 




M 


GcoPhys 


& MBio 


& MBio-M 


AMBio-M 


Eco-M 


GeoPhys 


GeoPhys-M 


iplinary 


graphy 




M ' 


Trial 1 


1 0.01 


0.00 


0.23 


0.00 


0.00 


0.00 


0.00 


0.25 


0.08 


0.07 


0.00 


0.00 


0.00 


2 0.10 


0.05 


0.05 


0.00 


0.00 


0.00 


0.00 


0.07 


0.08 


0.18 


0.00 


0.00 


0.08 


3 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.42 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


4 0.02 


0.00 


0.10 


0.00 


0.00 


0.00 


0.00 


0.05 


0.29 


0.06 


0.00 


0.00 


0.00 


5 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.35 


0.00 


0.00 


0.00 


0.21 


0.00 


0.00 


6 0.00 


0.00 


0.00 


0.07 


0.21 


0.03 


0.00 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


7 0.53 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.06 


0.00 


0.00 


0.01 


8 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.08 


0.08 


0.03 


0.53 


0.00 


0.00 


9 0.14 


0.14 


0.04 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.12 


0.00 


0.04 


0.00 


10 0.00 


0.00 


0.00 


0.34 


0.25 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


11 0.02 


0.03 


0.01 


0.00 


0.00 


0.00 


0.01 


0.09 


0.37 


0.09 


0.00 


0.00 


0.00 


10 n nn 
1/ u.uu 


n nn 
u.uu 


n nn 
u.uu 


n nn 
u.uu 


n n^ 

U.Uo 
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u.yu 
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U.UO 


n nn 
u.uu 


n nn 
u.uu 


n m 

U.Ul 


n nn 
u.uu 


n nn 
u.uu 


n nn 
u.uu 


13 0.00 


0.00 


0.00 


0.59 


0.51 


0.00 


0.00 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


14 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.35 


0.06 


15 0.01 


0.57 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


16 0.02 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.59 


0.01 


17 0.02 


0.00 


0.56 


0.00 


0.00 


0.00 


0.00 


0.39 


0.02 


0.12 


0.00 


0.00 


0.00 


18 0.12 


0.21 


0.01 


0.00 


0.00 


0.00 


0.00 


0.02 


0.08 


0.18 


0.00 


0.00 


0.00 


19 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.84 


20 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.17 


0.03 


0.00 


0.01 


0.26 


0.00 


0.00 


Trial 2 


1 0.00 


0.00 


0.00 


0.19 


0.18 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 0.02 


0.00 


0.76 


0.00 


0.00 


0.00 


0.00 


0.51 


0.05 


0.17 


0.00 


0.00 


0.00 


3 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.40 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


4 0.00 


0.00 


0.00 


0.16 


0.21 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.61 


6 0.10 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.02 


0.05 


0.00 


0.15 


0.16 


7 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.14 


0.23 


8 0.01 


0.62 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


9 0.00 


0.00 


0.00 


0.09 


0.36 


0.00 


0.00 


0.00 


0.00 


0.02 


0.00 


0.00 


0.00 


10 0.00 


0.00 


0.00 


0.35 


0.06 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


11 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.03 


0.00 


0.01 


0.32 


0.00 


0.00 


12 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.50 


0.00 


0.00 


0.00 


0.17 


0.00 


0.00 


13 0.18 


0.13 


0.02 


0.00 


0.00 


0.00 


0.00 


0.06 


0.21 


0.30 


0.00 


0.00 


0.00 


14 0.00 


0.00 


0.00 


0.21 


0.10 


0.02 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


15 0.02 


0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.68 


0.00 


16 0.63 


0.04 


0.04 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.16 


0.00 


0.02 


0.01 


17 0.00 


0.00 


0.00 


0.00 


0.08 


0.96 


0.08 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


18 0.01 


0.00 


0.03 


0.00 


0.00 


0.00 


0.01 


0.27 


0.53 


0.16 


0.00 


0.00 


0.00 



Astro 


Astro- 


Astro & 


BioChem 


BioChem 


BioTech & 


EnvSc & 


GeoChcm & 


GeoChem & 


Multidisc- 


Oceano- 


Physics 


Physics- 




M 


GcoPhys 


& MBio 


& MBio-M 


AMBio-M 


Eco-M 


GeoPhys 


GeoPhys-M 


iplinary 


graphy 




M 


19 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.03 


0.03 


0.01 


0.50 


0.00 


0.00 


20 0.03 


0.20 


0.14 


0.00 


0.00 


0.00 


0.00 


0.05 


0.17 


0.07 


0.00 


0.00 


0.00 


Trial 3 


1 0.00 


0.00 


0.00 


0.42 


0.30 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.00 


2 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.70 


3 0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.41 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


4 0.00 


0.00 


0.00 


0.35 


0.26 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 0.04 


0.19 


0.01 


0.00 


0.05 


0.00 


0.01 


0.04 


0.01 


0.12 


0.00 


0.00 


0.00 


6 0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.43 


0.01 


7 0.14 


0.07 
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